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Abstract 

Statistical models of natural stimuli provide an important tool for researchers in 
the fields of machine learning and computational neuroscience. A canonical way to 
quantitatively assess and compare the performance of statistical models is given by 
the likelihood. One class of statistical models which has recently gained increasing 
popularity and has been applied to a variety of complex data are deep belief networks. 
Analyses of these models, however, have been typically limited to qualitative analyses 
based on samples due to the computationally intractable nature of the model likelihood. 
Motivated by these circumstances, the present article provides a consistent estimator 
for the likelihood that is both computationally tractable and simple to apply in practice. 
Using this estimator, a deep belief network which has been suggested for the modeling 
of natural image patches is quantitatively investigated and compared to other models of 
natural image patches. Contrary to earlier claims based on qualitative results, the 
results presented in this article provide evidence that the model under investigation is 
not a particularly good model for natural images. 



1 Introduction 

Statistical models of naturally occurring stimuli constitute an important tool in machine 
learning and computational neuroscience, among many other areas. In machine learning, 
they have been applied both to supervised and unsupervised problems, such as denoising 
(e.g., Lyu and Simoncelli, 2007), classification (e.g., Lee et al., 2009) or prediction (e.g., 
Doretto et al., 2003). In computational neuroscience, statistical models have been used 
to analyze the structure of natural images as part of the quest to understand the tasks 
faced by the visual system of the brain (e.g., Lewicki and Doi, 2005; Olshausen and Field, 
1996). Other examples include generative statistical models studied to derive better models 
of perceptual learning. Underlying these approaches is the assumption that the low-level 
areas of the brain adapt to the statistical structure of their sensory inputs and are less 
concerned with goals of behavior. 

An important measure to assess the performance of a statistical model is the likelihood 
which allows us to objectively compare the density estimation performance of different 
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Figure 1: Left: Natural image patches sampled from the van Hateren dataset (van Hateren 
and van der Schaaf, 1998). Right: Filters learned by a deep belief network trained 
on whitened image patches. 

models. Given two model instances with the same prior probability and a test set of data 
samples, the ratio of their likelihoods already tells us everything we need to know to decide 
which of the two models is more likely to have generated the dataset. Furthermore, for 
densities p and j5, the negative expected log-likelihood represents the cross-entropy (or 
expected log-loss) term of the Kullback-Leibler (KL) divergence, 

Dkl[P\\p\ = -^2fi(x)logp(x) -H\p], 

x 

which is always non-negative and zero if and only if p and p are identical. The main moti- 
vation for the KL-divergence stems from coding theory, where the cross-entropy represents 
the coding cost of encoding samples drawn from p with a code that would be optimal for 
samples drawn from p. Correspondingly, the KL-divergence represents the additional cod- 
ing cost created by using an optimal code which assumes the distribution of the samples to 
be p instead of p. Finally, the likelihood allows us to directly examine the success of maxi- 
mum likelihood learning for different training settings. Unfortunately, for many interesting 
models, the likelihood is intractable to compute exactly. 

One such class of models which has attracted a lot of attention in recent years is given 
by deep belief networks. Deep belief networks are hierarchical generative models introduced 
by Hinton et al. (Hinton and Salakhutdinov, 2006) together with a greedy learning rule 
as an approach to the long-standing challenge of training deep neural networks, that is, 
hierarchical neural networks with many layers. In supervised tasks, they have been shown 
to learn representations which can be successfully employed in classification tasks, such 
as character recognition (Hinton et al., 2006a) and speech recognition (Mohamed et al., 
2009). In unsupervised tasks, where the likelihood is particularly important, they have 
been applied to a wide variety of complex datasets, such as patches of natural images 
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(Osindero and Hinton, 2008; Ranzato et al., 2010; Ranzato and Hinton, 2010; Lee and Ng, 

2007) , motion capture recordings (Taylor et al., 2007) and images of faces (Susskind et al., 

2008) . 

When applied to natural images, deep belief networks have been shown to develop 
biologically plausible features (Lee and Ng, 2007) and samples from the model were shown 
to adhere to certain statistical regularities also found in natural images (Osindero and 
Hinton, 2008). Examples of natural image patches and features learned by a deep belief 
network are presented in Figure 1. 

In this article, after reviewing the relevant aspects of deep belief networks, we will de- 
rive a consistent estimator for its likelihood and demonstrate the estimator's applicability 
in practice by evaluating a model trained on natural image patches. After a thorough 
quantitative analysis, we will argue that the deep belief network under consideration is 
not a particularly good model for estimating the density of small natural image patches, 
as it is outperformed with respect to the likelihood even by simple mixture models. Fur- 
thermore, we will show that adding layers to the network has only a small effect on the 
overall performance of the model if each layer is trained well enough and will offer a possi- 
ble explanation for this observation by analyzing a best-case scenario of the greedy learning 
procedure commonly used for training deep belief networks. 



In this chapter we will review the statistical models used in the remainder of this article. 
In particular, we will describe the restricted Boltzmann machine (RBM) and some of its 
variants which constitute the main building blocks for constructing deep belief networks 
(DBNs). Furthermore, we will discuss some of the models' properties relevant for estimating 
the likelihood of DBNs. Readers familiar with DBNs might want to skip this section or 
skim it to get acquainted with the notation. 

Throughout this article, the goal of applying statistical models is assumed to be the 
approximation of a particular distribution of interest — often called the data distribution. 
We will denote this distribution by p. 

2.1 Boltzmann Machines 

A Boltzmann machine is a potentially fully connected undirected graphical model — or Markov 
random field — with binary random variables. Its probability mass function is a Boltzmann 
distribution over 2 k binary states s 6 {0, l} fc which is defined in terms of an energy func- 
tion E, 
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where E is given by 
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(A) (B) (C) 

Figure 2: Boltzmann machines with different constraints on their connectivity. Filled nodes 
denote observed variables, unfilled nodes denote hidden variables. A: A fully con- 
nected latent- variable Boltzmann machine. B: A restricted Boltzmann machine 
forming a bipartite graph. C: A semi-restricted Boltzmann machine, which in 
contrast to RBMs also allows connections between the visible units. 

and depends on a symmetric weight matrix W G M fexfc with zeros on the diagonal, i.e. 
wu = for all? = 1, k, and bias terms b G R k . Z is called partition function and ensures 
the normalization of q. In the following, unnormalized distributions will be marked with an 
asterisk: 

q*(s) = Zq(s) = eM-E(s)). 

Samples from the Boltzmann distribution can be obtained via Gibbs sampling, which op- 
erates by conditionally sampling each univariate random variable until some convergence 
criterion is reached. From definitions (1) and (2) it follows that the conditional probability 
of a unit i being on given the states of all other units Sj^i is given by 

q(si = 1 | s^i) = g ^Y^WijSj + b^j , 

where g(x) = 1/(1 + exp(— x)) is the sigmoidal logistic function. The Boltzmann machine 
can be seen as a stochastic generalization of the binary Hopfield network, which is based 
on the same energy function but updates its units deterministically using a step function, 
i.e. a unit is set to 1 if J2j w ij s i + h > and set to otherwise. In the limit of increasingly 
large weight magnitudes, the logistic function becomes a step function and the deterministic 
behavior of the Hopfield network can be recovered with the Boltzmann machine (Hinton, 
2007). 

Of particular interest for building DBNs are latent variable Boltzmann machines, that is, 
Boltzmann machines for which the states s are only partially observed (Figure 2). We will 
refer to states of observed — or visible — random variables as x and to states of unobserved — 
or hidden — random variables as y, such that s can be written as s = (x, y). 

Approximation of the data distribution p(x) with the model distribution q{x) via max- 
imum likelihood (ML) learning can be implemented by following the gradient of the model 
log-likelihood. In Boltzmann machines, this gradient is conceptually simple yet computa- 
tionally very hard to evaluate. The gradient of the expected log-likelihood with respect to 
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some parameter 6 of the energy function is (Salakhutdinov, 2009): 



d_ 

do 



logg(x) 



S. 



d_ 



E(x,y) 



£p{x)q(y\x) 



d_ 
~d9 



E(x,y) 



(3) 



The first term on the right-hand side of this equation is the expected gradient of the energy 
function when both hidden and observed states are sampled from the model, while the 
second term is the expected gradient of the energy function when the hidden states are 
drawn from the conditional distribution of the model, given a visible state drawn from the 
data distribution. Following this gradient increases the energy of the states which are more 
likely with respect to the model distribution and decreases the energy of the states which 
are more likely with respect to the data distribution. Remember that by the definition of 
the model (1), states with higher energy are less likely than states with lower energy. 

As an example, the gradient of the log-likelihood with respect to the weight connecting 
a visible unit Xi and a hidden unit yj becomes 



c 

<-p(x)q(y\x 



[XiVj] ~ £ Q (x,y)[Xiyj}. 



A step in the direction of this gradient can be interpreted as a combination of Hebbian and 
anti-Hebbian learning (Hinton, 2003), where the first term corresponds to Hebbian learning 
and the second term to anti-Hebbian learning, respectively. 

Evaluating the expectations, however, is computationally intractable for all but the 
simplest networks. Even approximating the expectations with Monte Carlo methods is 
typically very slow (Long and Servedio, 2010). Two measures can be taken to make learning 
in Boltzmann machines feasible: constraining the Boltzmann machine in some way, or 
replacing the likelihood with a simpler objective function. The former approach led to the 
introduction of RBMs, which will be discussed in the next section. The latter approach 
led to the now widely used contrastive divergence (CD) learning rule (Hinton, 2002) which 
represents a tractable approximation to ML learning: In CD learning, the expectation over 
the model distribution q(x, y) is replaced by an expectation over 



qcT>(x,y) = ^2 P( x o)Q(yi I x )q(x \ yi)q(y 



xo,yi 

from which samples are obtained by taking a sample xq from the data distribution, updating 
the hidden units, updating the visible units, and finally updating the hidden units again, 
while in each step keeping the respective set of other variables fixed. This corresponds 
to a single sweep of Gibbs sampling through all random variables of the model plus an 
additional update of the hidden units. If instead n sweeps of Gibbs sampling are used, the 
learning procedure is generally referred to as CD(n) learning. For n — > oo, ML learning is 
regained (Salakhutdinov, 2009). 

2.2 Restricted Boltzmann Machines 

A restricted Boltzmann machine (RBM) (Smolensky, 1986) is a Boltzmann machine whose 
energy function is constrained such that no direct interaction between two visible units or 
two hidden units is possible, 



E(x, y) = —x T Wy — b T x — c T y. 
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The corresponding graph has no connections between the visible units and no connections 
between the hidden units and is hence bipartite (Figure 2). The weight matrix W £ R mxn 
is different to the one in equation (2) in that it now only contains interaction terms between 
the m visible units and the n hidden units and therefore no longer needs to be symmetric or 
constrained in any other way. Despite these constraints it has been shown that RBMs are 
universal approximators, i.e. for any distribution over binary states and any e > 0, there 
exists an RBM with a KL-divergence which is smaller than e (Roux and Bengio, 2008). 

In an RBM, the visible units are conditionally independent given the states of the hidden 
units and vice versa. The conditional distribution of the hidden units, for instance, is given 
by 

Q(V I x ) = II q (yi I x )> q ( y i I x ) = 9{w]x + cj), 
j 

where g is the logistic sigmoidal function and Wj is the j-th column of W. This allows 
for efficient Gibbs sampling of the model distribution (since one set of variables can be 
updated in parallel given the other) and thus for faster approximation of the log-likelihood 
gradient. Moreover, the unnormalized marginal distributions q*(x) and q*(y) of RBMs can 
be computed analytically by integrating out the respective other variable. For instance, the 
unnormalized marginal distribution of the visible units becomes 

q*(x) = exp(6 T a;) + exp(wj x + c,)). (4) 

j 

Two other models which can be used for constructing DBNs are the Gaussian RBM (GRBM) 
(Salakhutdinov, 2009) and the semi-restricted Boltzmann machine (SRBM) (Osindero and 
Hinton, 2008). The GRBM employs continuous instead of binary visible units (while keep- 
ing the hidden units binary) and can thus be used to model continuous data. Its energy 
function is given by 

E(x,y) = ^\\x-b\\ 2 -^x T Wy-c T y. (5) 

A somewhat more general definition allows a different a for each individual visible unit 
(Salakhutdinov, 2009). As for the binary Boltzmann machine, training of the GRBM pro- 
ceeds by following the gradient given in equation (3), or an approximation thereof. Its 
properties are similar to that of an RBM, except that its conditional distribution q{x \ y) 
is a multivariate Gaussian distribution whose mean is determined by the hidden units, 

q(x | y) = N(x; Wy + b, a I). 

Each state of the hidden units encodes one mean, a controls the variance of each Gaussian 
and is the same for all states of the hidden units. The GRBM can therefore be interpreted 
as a mixture of an exponential number of Gaussian distributions with fixed, isotropic co- 
variance and parameter sharing constraints. 

In an SRBM, only the hidden units are constrained to have no direct connections to each 
other while the visible units are unconstrained (Figure 2). Importantly, analytic expressions 
are therefore only available for q*(x) but not for q*{y). Furthermore, q(x \ y) is no longer 
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factorial. For efficiently training DBNs, conditional independence of the hidden units is 
more important than conditional independence of the visible units (Osindero and Hinton, 
2008). This is in part due to the fact that the second-term on the right-hand side of equation 
(3) can still efficiently be evaluated if the hidden units are conditionally independent, and 
in part due to the way inference is done in DBNs. 

2.3 Deep Belief Networks 

z 



y 



X 

Figure 3: A graphical model representation of a two-layer deep belief network composed of 
two RBMs. Note that the connections of the first layer are directed. 

DBNs (Hinton and Salakhutdinov, 2006) are hierarchical generative models composed 
of several layers of RBMs or one of their generalizations. While DBNs have been widely 
used as part of a heuristic for learning multiple layers of feature representations and for 
pretraining multi-layer perceptrons (by initializing the multi-layer perceptron with the pa- 
rameters learned by a DBN), the existence of an efficient learning rule has made them 
become attractive also for density estimation tasks. 

For simplicity, we will begin by defining a two-layer DBN. Let q(x,y) and r(y,z) be 
the densities of two RBMs over visible states x and hidden states y and z. Then, the joint 
probability mass function of a two-layer DBN is defined to be 

P{x,y,z) = q(x | y)r(y,z). (6) 

Interestingly, the resulting distribution is best described not as a deep Boltzmann machine, 
as one might expect, or even an undirected graphical model, but as a graphical model 
with undirected connections between y and z and directed connections between x and y 
(Figure 3). This characteristic of the model becomes evident in the generative process. A 
sample from the model can be drawn by first Gibbs sampling the distribution r(y, z) of the 
top layer to produce a state for y. Afterwards, a sample is drawn from the much simpler 
distribution q(x \ y). 

The definition can easily be extended to DBNs with three or more layers by replacing 
r(y,z) = r(y \ z)r(z) with r(y \ z)s(z), where s(z) is the marginal distribution of another 
RBM. Thus, by adding additional layers to the DBN, the prior distribution over the top-level 
hidden units — r(z) for the model defined in equation (6) — is effectively replaced with a new 
prior distribution — in this case s{z). DBNs with an arbitrary number of layers, like RBMs, 
have been shown to be universal approximators even if the number of hidden units in each 
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layer is fixed to the number of visible units (Sutskever and Hinton, 2008). As mentioned 
earlier, another possibility to generalize DBNs is to allow for more general models as layers. 
One such model is the SRBM, which can model more complex interactions by having less 
restrictive independce assumptions. Alternatively, one could allow for models with units 
whose conditional probability distributions are not just binary, but can be any exponential 
family distribution (Welling et al., 2005) — one instance being the GRBM. 

The greedy learning procedure (Hinton et al., 2006a) used for training DBNs starts by 
fitting the first-layer RBM (the one closest to the observations) to the data distribution. 
Afterwards, the prior distribution over the hidden units defined by the first layer, q(y) = 
^2 x q(x,y), is replaced by the marginal distribution of the second layer, r(y), and the 
parameters of the second layer are trained by optimizing a lower bound to the log- likelihood 
of the two-layer DBN. In the following, we will derive this lower bound. 

Let He a parameter of r. The gradient of the log-likelihood with respect to 6 is 

d . , . 1 ^ d 



de 



\ogp{x) 



p(x) 2 oe 1 

y 



1 

p(x) 



d_ 



^2 P{x,y)^ log p(x,y) 







= YjP(y I x )-^ l °s(r{y)q{x | y)) 



( 



= J2p(v I x) 



— logr(y) + — logq(x 



y) 



J2p(y 



\ 

o 



x) Q0 l °S r(y). 



(7) 



Approximate ML learning could therefore in principle be implemented by training the sec- 
ond layer to approximate the posterior distribution p(y \ x) using CD learning or a similar 
algorithm. However, exact sampling from the posterior distribution p(y \ x) is difficult, as 
its evaluation involves integrating over an exponential number of states, 



p(y 



p(x,y) ^ q(x | y)r{y) 
P(x) £« q(x | y)r(y) ' 



In order to make the training feasbile again, the posterior distribution is replaced by the 
factorial distribution q(y \ x). Training the DBN in this manner optimizes a variational 
lower bound on the log-likelihood, 



\ogp{x) = \og^2q(x | y)r(y) 



= log^g(y I x)^jr(y) 

> q(y I x ) i°s r (y) + const, 



(8) 
(9) 
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where (8) follows from Bayes' theorem, (9) is due to Jensen's inequality and const is constant 
in 9, as only r depends on 9. Taking the derivative of (9) with respect to 9 yields (7) with the 
posterior distribution p(y \ x) replaced by q(y \ x). The greedy learning procedure can be 
generalized to more layers by training each additional layer to approximate the distribution 
obtained by conditionally sampling from each layer in turn, starting with the lowest layer. 

3 Likelihood Estimation 

In this section, we will discuss the problem of estimating the likelihood of a two-layer DBN 
with joint density 

P{x,y, z) = q(x | y)r(y,z). (10) 
That is, for a given visible state x, to estimate the value of 

p( x ) = ^q( x I y) r (.y, z )- 

As we will see later, this problem can easily be generalized to more layers. As before, q(x, y) 
and r(y, z) refer to the densities of two RBMs. 

Two difficulties arise when dealing with this problem in the context of DBNs. First, 
r(y, z) depends on a partition function Z r whose exact evaluation requires integration over 
an exponential number of states. Second, despite our ability to integrate analytically over 
z, even computing just the unnormalized likelihood still requires integration over an expo- 
nential number of hidden states y, 

p*( x ) = J2 q ( x I y) r *(y)- 

y 

After briefly reviewing previous approaches to resolving these difficulties, we will propose an 
unbiased estimator for p*(x), its contribution being a possible solution to the second prob- 
lem, and discuss how to construct a consistent estimator for p(x) based on this estimator. 
Finally, we will demonstrate its applicability to more general DBNs. 

3.1 Previous Work 

3.1.1 Annealed Importance Sampling 

Salakhutdinov and Murray (2008) have shown how annealed importance sampling (AIS) 
(Neal, 2001) can be used to estimate the partition function of a restricted Boltzmann ma- 
chine. Since our estimator will also rely on AIS estimates of the partition function, we will 
shortly describe the procedure here. 

Importance sampling is a Monte Carlo method for unbiased estimation of expectations 
(MacKay, 2003) and is based on the following observation: Let s be a density with s(x) > 
whenever q*(x) > and let w(x) = ^j^y, then 

(*)/(*) = E s w® /(*) = *.(*) w*)fw ( n ) 
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for any function f(x). s is called the proposal distribution and w(x) is called importance 
weight. For /(x) = 1, we get 



^K,)]^^)^^, (12) 
v six) 

Estimates of the partition function Z q can therefore be obtained by drawing samples 
from a proposal distribution and averaging the resulting importance weights w(x^). It 
was pointed out in (Minka, 2005) that minimizing the variance of the importance sampling 
estimate of the partition function (12) is equivalent to minimizing an a-divergence 1 between 
the proposal distribution s and the true distribution q. Therefore, for the estimate to work 
well in practice, s should be both close to q and easy to sample from. 

Annealed importance sampling (Neal, 2001) tries to circumvent some of the problems 
associated with finding a suitable proposal distribution. Assume we can construct a distri- 
bution s\ which approximates q well, but which is still difficult to sample from or which we 
can only evaluate up to a normalization factor. Let s 2 be another distribution. This distri- 
bution will effectively act as a proposal distribution for s±. Further, let T\ be a transition 
operator which leaves the distribution of si invariant, i.e. let Xi(xo;xi) be a probability 
distribution over xo depending on x\, such that 

si(xo) = y~] si(xi)Ti(xq; xi). 



We then have 



Z q = ^si(xq) 



"(xq) 



si(xo) 

EE'i(*i)Ti(a;o;si) ( 

Xo Xl v ' 

s*(xi)g*(x ) 

xo Xl 



XO 

,Q*(xo) 



t^t^ s 2 (xi)^(x ) 



Note that we don't have to know the partition function of si to evaluate the right-hand 
term. Also note that we don't need to sample from si but only from Ti if we want to 
estimate this term via Monte Carlo integration. If S2 is still too difficult to handle, we can 
apply the same trick again by introducing a third distribution S3 and a transition operator 
T 2 for S2- By induction, we can see that 

~ ST ( \r ( \ t ( X-iO^-i) Q*( x o) 
Z q = ) s»(x ra „i)r ra _i(x n _ 2 ; x n _ij • • •Ti(x ;xi) r — — r, 

Sn(Xn~l) s i\Xo) 

where the sum integrates over all x = (xq, x n -i)- Hence, in order to estimate the partition 
function, we can draw independent samples x n _i from a simple distribution s n , use the 
transition operators to generate intermediate samples x n _2, ...,xo, and use the product of 
fractions in the preceding equation to compute importance weights, which we then average. 



1 With a = 2. a-divergences are a generalization of the KL-divergence. 
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In order to be able to apply AIS to RBMs, a sequence of intermediate distributions and 
corresponding annealing weights is denned: 

sUx)=q*(x) 1 -^s(x)^, /3 fc G [0,1] 

for k = 0, n, where (3q = and /3 n = 1. If we also choose an RBM for s, then Sk is itself 
a Boltzmann distribution whose energy function is a weighted sum of the energy functions 
of s and q. Similarly, natural and efficient implementations based on Gibbs sampling can 
be found for the transition operators T^. 

3.1.2 Estimating Lower Bounds 

In (Salakhutdinov and Murray, 2008) it was also shown how estimates of a lower bound on 
the log-likelihood, 

logp(z) >Tq(y\ x)\og r * {v)q{ *\ V) -\ogZ r (13) 

= /^ j q(.y\ x)\ogr* (y)q(x \ y) + H[q(y \ x)) - logZ r , (14) 

y 

can be obtained, provided the partition function Z r is given. This is the same lower bound 
as the one optimized during greedy learning (9). Since q(y \ x) is factorial, the entropy 
7~L[q{y | x)] can be computed analytically. The only term which still needs to be estimated 
is the first term on the right-hand side of equation (14). This was achieved in (Salakhutdinov 
and Murray, 2008) by drawing samples from q(y \ x). 

3.1.3 Consistent Estimates 

In (Murray and Salakhutdinov, 2009), carefully designed Markov chains were constructed 
to give unbiased estimates for the inverse posterior probability ^t~t^ of some fixed hidden 
state y. These estimates were then used to get unbiased estimates of p*(x) by taking 
advantage of the fact p*(x) = P p ^-j ■ The corresponding partition function was estimated 
using AIS, leading to an overall estimate of the likelihood that tends to overestimate the 
true likelihood. While the estimator was constructed in such a way that even very short 
runs of the Markov chain result in unbiased estimates of p*(x), even a single step of the 
Markov chain is slow compared to sampling from q(y \ x) as it was done for the estimation 
of the lower bound (13). 

3.2 A New Estimator for DBNs 

The estimator we will introduce in this section shares the same formal properties as the 
estimator proposed in (Murray and Salakhutdinov, 2009), but will utilize samples drawn 
from q(y \ x). This will make it conceptually as simple and as easy to apply in practice as 
the estimator for the lower bound (13), while providing us with consistent estimates of p(x). 
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3.2.1 Definition 



Let p(x, y, z) be the joint density of a DBN as denned in equation (10). By applying Bayes' 
theorem, we obtain 

p( x ) = ^2<i( x \ y) r (y) ( 15 ) 

y 

= ^i(y\ x ) qi r ) r (y) ( 16 ) 

y q ^ y) 
= £^b)^4^. (17) 

An obvious choice for an estimator of p(x) is then 



where j/ n ) ~ q(y^ \ x) for n = 1, AT. For RBMs, the unnormalized marginals q r *(x), q*(y) 
and r*(y) can be computed analytically (4). Note that the partition function Z r only has 
to be calculated once for all visible states we wish to evaluate. Intuitively, the estimation 
process can be imagined as first assigning a basic value to x using the distribution of the 
first layer, and then with every sample adjusting this value depending on how the second 
layer distribution relates to the first layer distribution. 



3.2.2 Properties 



Under the assumption that the partition function Z r is known, p(x) provides an unbiased 
estimate of p(x) since the sample average is always an unbiased estimate of the expectation. 
However, Z r is generally intractable to compute exactly so that approximations become 
necessary. In fact, in (Long and Servedio, 2010) it was shown that already approximating 
the partition function of an RBM to within a multiplicative factor is generally NP-hard in 
the number of parameters of the RBM. 

If in the estimate (18), Z r is replaced by an unbiased estimate Z r , then the overall 
estimate will tend to overestimate the true likelihood, 



p*(x) =p(x), 



~P*n(x)~ 


= £ 


' 1 " 












1 




> 






£ 





where p* N (x) = Z r pj^(x) is an unbiased estimate of the unnormalized density. The second 
step is a consequence of Jensen's inequality and the averages are taken with respect top]y(x) 
and Z r , which are independent; x is held fix. 
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While the estimator loses its unbiasedness for unbiased estimates of the partition func- 
tion, it still retains its consistency. Since p* N (x) is unbiased for all N G N, it is also 
asymptotically unbiased, 

plim p* N (x) = p*(x). 

JV->oo 

Furthermore, if Z r ^ for N G N is a consistent sequence of estimators for the partition 
function, it follows that 

~* ( \ phmp^(x) 

P N (X) at^oo p*{x) 
pl lm = = = 

JV^oo Z rj 7V plim Z r ^N 

Unbiased and consistent estimates of Z r can be obtained using AIS (Salakhutdinov and 
Murray, 2008). Note that although the estimator tends to overestimate the true likelihood 
in expectation and is unbiased in the limit, it is still possible for it to underestimate the 
true likelihood most of the time. This behavior can occur if the distribution of estimates is 
heavily skewed. 

Another question which remains is whether the estimator is good in terms of efficiency, 
or in other words: How many samples are required before a reliable estimate of the true 
likelihood is achieved? To address this question, we reformulate the expectation in equation 
(17) to give 

P(x) = J2q(y\ x) P j- X \ y \ . 
y Q{y I x) 

In this formulation it becomes evident that estimating p(x) is equivalent to estimating the 
partition function of p(y \ x) using importance sampling. To see this, notice that, for a 
fixed x, p(x,y) is just an unnormalized version of p(y \ x), where p(x) is the normalization 
constant, 

p(y x) = . 

p[x) 

The proposal distribution in this case is q{y \ x). As mentioned earlier, the efficiency of 
importance sampling estimates depends on how well the proposal distribution approximates 
the true distribution. Therefore, for the proposed estimator to work well in practice, q(y \ x) 
should be close to p{y \ x). Note that a similar assumption is made when optimizing the 
variational lower bound (9) during greedy learning. 



3.2.3 Generalizations 

The definition of the estimator for two-layer DBNs readily extends to DBNs with L layers. If 
p(x) is the marginal density of a DBN whose layers are constituted by RBMs with densities 
qi,...,qL and partition functions Z±,...,Zl, and if we refer to the states of the random 
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vectors in each layer by xo, xl, where xo contains the visible states and xl contains the 
states of the top hidden layer, then 



L-1 

p( x o) = e ^(xl-i) n n( x i-i i x c 

Xl,...,X L _l 1=1 

L-1 

= ^2 <1l( x l-i) II Qi( x i I x i-i] 



Xl,...,X L _ 1 



1=1 

L-1 



E Yl Qi( x i \ x i-i 



Xl,...,X L - 1 1 = 1 



Qi( x i-i) 

q*{ x i) 

q*A x l) 



In order to estimate this term, hidden states x±,...,xl are generated in a feed-forward 
manner using the conditional distributions qi(xi | The weights are computed 

along the way, then multiplied together and finally averaged over all drawn states. 

Often, a DBN not only contains RBMs but also more general distributions q(x, y) (see, 
for example, Roux et al., 2010; Osindero and Hinton, 2008; Ranzato et al., 2010; Ranzato 
and Hinton, 2010). In this case, analytical expressions of the unnormalized distribution 
over the hidden states q*(y) might be unavailable, as, for example, for the SRBM. If AIS or 
some other importance sampling method is used for the estimation of the partition function, 
however, the same importance samples and importance weights can be used in order to get 
unbiased estimates of q*(y), as we will show in the following. 

As in equation (11), let s be a proposal distribution and w be importance weights such 
that 

J2s(x)w(x)f(x) = J2q*{x)f(x). 

X X 

for any function /. By noticing that q*(y) = J2 X q*{ x )q{v I x )i ^ easy to see how estimates 
of q* (y) can be obtained using the same importance samples and importance weights which 
are used for estimating the partition function, 

n 

As for the partition function, the importance weights only have to be generated once for all 
x and all hidden states that are part of the evaluation. Estimating q* (y) in this manner, 
however, introduces further bias into the estimator. Also note that a good proposal distri- 
bution for estimating the partition function need not be a good proposal distribution for 
estimating the marginals. The optimal proposal distribution for estimating the marginals 
would be q(x), as in this case any importance weight would take on the value of the partition 
function itself (12). The optimal proposal distribution for estimating the value of the un- 
normalized marginal distribution q*(y), on the other hand, is q(x \ y), which unfortunately 
depends on y. Therefore, more importance samples will be needed in order to get reliable 
estimates of the marginals. 
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3.3 Potential Log-Likelihood 

In this section, we will discuss the concept of the potential log-likelihood — a concept which 
appears in (Roux and Bengio, 2008). By considering a best-case scenario, the potential 
log-likelihood gives an idea of the log-likelihood that can at best be achieved by training 
additional layers using greedy learning. Its usefulness will become apparent in the experi- 
mental section. 

Let q(x, y) be the distribution of an already trained RBM or one of its generalizations, 
and let r{y) be a second distribution — not necessarily the marginal distribution of any Boltz- 
mann machine. As in section 2.3, r{y) serves to replace the prior distribution over the hidden 
variables, q(y), and thereby improve the marginal distribution over x, J2 y l( x I y) r {y)- As 
above, let p(x) denote the data distribution. Our goal, then, is to increase the expected 
log-likelihood of the model distribution with respect to r, 

J2p(x)\og^2q(x\y)r(y). (20) 

x y 

In applying the greedy learning procedure, we try to reach this goal by optimizing a lower 
bound on the log-likelihood (9), or equivalently, by minimizing the following KL-divergence: 



Dkl 



^p{x)q{y | x)\\r(y) 



-^2p(x)^2q(y | z)logr(y) + const, 



where const is constant in r. 

The KL divergence is minimal if r(y) is equal to 

^p{x)q(y | x) (21) 

x 

for every y. Since RBMs are universal approximators (Roux and Bengio, 2008), this dis- 
tribution could in principle be approximated arbitrarily well by a single, potentially very 
large RBM (provided the y are binary). 

Assume that we have found this distribution, that is, we have maximized the lower 
bound with respect to all possible distributions r. The distribution for the DBN which we 
obtain by replacing r in (20) with (21) is then given by 

^2q( x I y) S ^p{xG)q{y | x ) = ^p(x )^<7(x | y)q(y | x ) 

y xo xo y 

= ^2p{xo)qo{x | x ), 

Xo 

where we have used the reconstruction distribution 

q {x | x ) = ^q(x\ y)q(y | x ), 
y 

which can be sampled from by conditionally sampling a state for the hidden units, and then, 
given the state of the hidden units, conditionally sampling a reconstruction of the visible 
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— log-likelihood 

— lower bound 



space of distributions r 



Figure 4: A cartoon explaining the potential log-likelihood. The potential log-likelihood is 



the log-likelihood evaluated at (2), where the lower bound reaches its optimum (1). 
This does not exclude the existence of a distribution r for which the log-likelihood 
is larger than the potential log-likelihood, as in (3), but it is unlikely that this 
point will be found by greedy learning, which optimizes r only with respect to 
the lower bound. 



units. The log-likelihood we achieve with this lower-bound optimal distribution is given by 



we will refer to this log-likelihood as the potential log-likelihood (and to the corresponding 
log-loss as the potential log-loss). Note that the potential log- likelihood is not a true upper 
bound on the log-likelihood that can be achieved with greedy learning, as suboptimal solu- 
tions with respect to the lower bound might still give rise to higher log-likelihoods. However, 
if such a solution was found, it would have rather been by accident than by design. The 
situation is depicted in the cartoon in Figure 4. 

4 Experiments 

In order to test the estimator, we considered the task of modeling 4x4 natural image patches 
sampled from the van Hateren dataset (van Hateren and van der Schaaf, 1998). We chose a 
small patch size to allow for a more thorough analysis of the estimator's behavior and the 
effects of certain model parameters. In all experiments, a standard battery of preprocessing 
steps was applied to the image patches, including a log-transformation, a centering step and 
a whitening step. Additionally, the DC component was projected out and only the other 
15 components of each image patch were used for training (for details, see Eichhorn et al., 
2009). 

In (Osindero and Hinton, 2008), a three- layer DBN based on GRBMs and SRBMs was 
suggested for the modeling of natural image patches. The model employed a GRBM in the 
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layers 


true avg. log-loss 


est. avg. log- loss 


1 


2.0550777 


2.0551289 


2 


2.0550775 


2.0550734 


3 


2.0550773 


2.0544256 



Table 1: True and estimated log-loss of a small-scale version of the model. Adding more 
layers to the network does not help to improve the performance if the GRBM 
employs only few hidden units. 

first layer and SRBMs in the second and third layer. In contrast to samples from the same 
model without lateral connections, samples from the proposed model were shown to possess 
some of the statistical regularities also found in natural images, such as sparse distributions 
of pixel intensities and the right pair-wise statistics of Gabor filter responses. Furthermore, 
the first layer of the model was shown to develop oriented edge filters (Figure 1). In the 
following, we will further analyse this type of model by estimating its likelihood. 

For training and evaluation, we used 10 independent pairs of training and test sets con- 
taining 50000 samples each. We trained the models using the greedy learning procedure 
described in Section 2.3. The scale-parameter a of the GRBM (5) was chosen via cross- 
validation. After training a GRBM, we initialized the second-layer SRBM such that its 
visible marginal distribution is equal to the hidden marginal distribution of the GRBM. 
Initializing the second layer in this manner has the following advantages. First, after ini- 
tialization, the likelihood of the two-layer DBN consisting of the trained GRBM and the 
initialized SRBM is equal to the likelihood of the GRBM. Second, the lower bound on the 
DBN's log-likelihood (9) is equal to its actual log-likelihood. Using the notation of the 
previous sections: 

As a consequence, an improvement in the lower bound necessarily leads to an improvement 
in the log-likelihood (Salakhutdinov, 2009). 

All trained models were evaluated using the proposed estimator. We used AIS in order 
to estimate the partition functions and the marginals of the SRBMs. Performances were 
measured as average log-loss in bits and normalized by the number of components. Details 
on the training and evaluation parameters can be found in Appendix A. 

4.1 Small Scale Experiment 

In a first experiment, we investigated a small-scale version of the model for which the 
likelihood is still tractable. It employed 15 hidden units in the first layer, 15 hidden units 
in the second layer and 50 hidden units in the third layer, where each layer was trained for 
50 epochs using CD(1). Brute-force and estimated results are given in Table 1. 

A first observation which can be made is that the estimated performance is very close 
to the true performance. Another observation is that the second and third layer do not 
help to improve the performance of the model, which hints at the fact that the 15 hidden 
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Figure 5: Le/i; A small-scale DBN was evaluated while either only estimating the parti- 
tion function of the third-layer SRBM (orange curve) or estimating the hidden 
marginal distribution of the second-layer SRBM (blue curve) , while using different 
numbers of AIS samples. The parameters of the AIS procedure were the same 
for both estimates. In particular, the same number of intermediate annealing 
distributions was used. Unsurprisingly, the estimated log-loss is more sensitive 
to the number of samples used for estimating the marginals. Right: The graph 
shows the estimated performance of DBN-100 while changing the number of im- 
portance samples used to estimate the marginals of the second-layer SRBM. The 
plot indicates that the true log-loss is still slightly larger than the estimates we 
obtained even after taking 10 5 samples. 



units of the GRBM are unable to capture much of the information in the continuous visible 
units. 

In order to evaluate the likelihood of this model using the proposed estimator, the un- 
normalized marginals of the second-layer SRBM's hidden units with respect to the SRBM 
as well as the partition function of the third layer SRBM had to be estimated. We investi- 
gated the effect of the number of importance samples used in both estimates on the overall 
estimate of the log-loss and made the following observations. First, almost no error could 
be observed in the estimates of the partition function — and hence of the log-loss — even if 
just one importance sample was used (left plot in Figure 5). This is the case if the proposal 
distribution is very close to the true distribution, as can be seen from equation (12) by 
replacing the former with the latter. However, the reason for this observation is likely to 
be found in the small model size and the fact that the third layer contributes virtually 
nothing to an explanation of the data. As the model becomes larger, more samples will be 
required. Second, as expected, many more samples are needed for a satisfactory approxi- 
mation of the marginals. Using too few samples led to overestimation of the likelihood and 
underestimation of the log- loss, respectively. 
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Figure 6: A comparison of different models. For each model, the estimated log-loss in bits 
per data component is shown, averaged over 10 independent trials with inde- 
pendent training and test sets. The number behind each model hints either at 
the number of hidden units or at the number of mixture components used. All 
GRBMs and DBNs were trained with CD(1). Larger values correspond to worse 
performance. 



4.2 Model Comparison 

In a next experiment, we compared the performance of a larger instantiation of the model to 
the performance of linear ICA (Eichhorn et al., 2009) as well as several mixture distributions. 
The model employed 100 hidden units in each layer and each layer was trained for 100 
epochs. As in (Osindero and Hinton, 2008), CD(1) was used to train the layers. 

Perhaps closest in interpretation to the GRBM as well as to the DBN is the mixture of 
isotropic Gaussian distributions (MoIG) with identical covariance and varying mean. Note 
that after the parameters of the GRBM have been fixed, adding layers to the DBN only 
affects the prior distribution over the means learned by the GRBM, but has no effect on 
their positions. As for the GRBM, the scale parameter common to all Gaussian mixture 
components was chosen by cross-validation. Other models taken into account are mixtures of 
Gaussians with unconstrained covariance but zero mean (MoG), and mixtures of elliptically 
contoured Gamma distributions with zero mean (MoEC) (Hosseini and Bethge, 2007). 

The results in Figure 6 suggest that mixture components with freely varying covariance 
are better suited for capturing the structure of 4x4 image patches than mixture components 
with fixed covariance. Strikingly, the DBN with 100 hidden units in each layer yielded an 
even larger log-loss than the MoG-2 model. On the other hand, both the DBN and the 
GRBM outperform the MoIG-100 model, which in contrast to MoG-2 adjusted the means 
but not the covariance. 
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Figure 7: Estimated performance of three DBN-100 models trained with different learn- 
ing rules. The improvement per layer decreases as each layer is trained more 
thoroughly. For each learning rule, out of 10 trials, only the trial with the best 
performance is shown. The dashed lines indicate the estimated potential log-loss 
of the first-layer GRBM. 

Due to the need to estimate the SRBM's marginals, the estimate of the DBN's perfor- 
mance might still be too optimistic. As the right plot in Figure 5 indicates, the true log-loss 
is likely to be a bit larger. Also note that by using more hidden units, the performance 
of both the GRBM and the DBN might still improve. Of course, the same is true for the 
mixture models, whose performance might also be improved by taking more components. 

Without lateral connections, that is, with RBMs instead of SRBMs, adding layers to the 
network only decreased the overall performance. For a model with 100 hidden units in each 
layer, trained with CD(1) and the same learning parameters as for the model with lateral 
connections, we estimated the average log-loss to be approximately 1.945±4.3E-3 (mean ± 
SEM, averaged over 10 trials). This suggests that the lateral connections did indeed help 
to improve the performance of the model. 

4.3 Effect of Additional Layers 

Using better approximations to ML learning by taking larger CD parameters led to an 
improved performance of the GRBM. However, the same could not be observed for the 
three-layer DBN, whose estimated performance was almost the same for all tested CD 
parameters (Figure 7). In other words, adding layers to the network was less effective if 
each layer was trained more thoroughly. 

In many cases, adding a third layer led to an even worse performance if the model was 
trained with CD(5) or CD(10). A likely cause for this behavior are too large learning rates, 
leading to a divergence of the training process. In Figure 7, only the best results are shown, 
for which the training converged. 

The estimated improvement of the three-layer DBN over the GRBM is about 0.1 bit 
when trained with CD(5) or CD(10). An important question to ask is why the improvement 
per added layer is so small. Insight into this question might be gained by evaluating the 
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Figure 8: Estimated potential log- loss. Each graph represents the estimated potential log- 
loss of one GRBM, averaged over 10 estimates with different test sets. The size 
of the data sets used in the estimates is given on the horizontal axis. Error 
bars indicate one standard deviation. After 50000 samples, the estimates of the 
potential log-loss have still not converged. 



potential log-likelihood of the GRBM, which represents a practical limit to the performance 
that can be achieved by means of greedy learning and can in principle be evaluated even 
before training any additional layers. If the potential log-loss of a trained GRBM is close 
to its log-loss, adding layers is a priori unlikely to prove useful. However, exact evaluation 
of the potential log-likelihood is intractable, as it involves two nested integrals with respect 
to the data distribution, 

J p(x)log J p(x )q (x | x )dx dx. 

Nevertheless, using optimistic estimates, we were still able to infer something about the 
DBN's capability to improve over the GRBM: We estimated the potential log-likelihood 
using the same set of data samples to approximate both integrals, thereby encouraging 
optimistic estimation. Note that estimating the potential log-likelihood in this manner is 
similar to evaluating the log-likelihood of a kernel density estimate on the training data, 
although the reconstruction distribution qo(x \ xq) might not correspond to a valid kernel. 
Also note that by taking more and more data samples, the estimate of the potential log-loss 
should become more and more accurate. Figure 8 indicates that the potential log-loss of a 
GRBM with 100 hidden units and trained with CD(1) is at least 1.66 or larger, which is 
still worse than the performance of, for example, the mixture of Gaussian distribution with 
5 components. 

Ideally, while training the first layer, one would like to take into account the fact that 
more layers will be added to the network. The potential log-loss suggests a regularization 
which minimizes the reconstruction error. Given that a model with perfect reconstruction is 
a fixed point of CD learning (Roux and Bengio, 2008) and considering the fact that a DBN 
trained with CD(1) led to the same performance as a DBN trained with CD(10) (Figure 7), 
one might hope that CD already has such a regularizing effect. As the left plot in Figure 8 
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shows, however, this could not be confirmed: Better approximations to ML learning led to 
a better estimated potential log- loss. 

5 Discussion 

We have shown how the likelihood of DBNs can be estimated in a way that is both tractable 
and simple enough to be used in practice. Reliable estimators for the likelihood are an 
important tool not only for the evaluation of models deployed in density estimation tasks, 
but also for the evaluation of the effect of different training settings and learning rules 
which try to optimize the likelihood. Thus, the introduced estimator potentially adds to the 
toolbox of everyone training DBNs and facilitates the search for better learning algorithms 
by allowing one to evaluate their effect on the likelihood directly. 

However, in cases where models with intractable unnormalized marginal distributions 
are used to build up a DBN, estimating the likelihood of DBNs with three or more layers 
is still a difficult problem. More efficient ways to estimate the unnormalized marginals will 
be required if the proposed estimator is to be used with much larger models than the ones 
discussed in this article. In the common case where a DBN is solely based on RBMs, this 
problem does not occur and the estimator is readily applicable. 

We have provided evidence that a particular DBN is not very well suited for the task 
of modeling natural image patches if the goal is to do density estimation. Furthermore, 
we have shown that adding layers to the network improves the overall performance of the 
model only by a small margin, especially if the lower layers are trained thoroughly. By 
estimating the potential log- loss — a joint property of the trained first-layer model and the 
greedy learning procedure — we showed that even with a lower-bound optimal model in the 
second layer, the overall performance of the DBN would have been unlikely to be much 
better. 

The potential log- loss suggests two possible ways to improve the training procedure: On 
the one hand, the lower layers might be regularized so as to keep the potential improvement 
that can be achieved with greedy learning large. On the other hand, the lower bound 
optimized during greedy learning might be replaced with a different objective function 
which represents a better approximation to the true likelihood. Future research will have 
to show whether these approaches are feasible and can lead to measurable improvements. 

The research on hierarchical models of natural images is still in its infancy. Although 
several other attempts have been made to create multi-layer models of natural images 
(Sinz et al., 2010; Koster and Hyvarinen, 2010; Hinton et al., 2006b; Karklin and Lewicki, 
2005), these models have either been (by design) limited to two layers, or a substantial 
improvement beyond two layers has not been found. Instead, the optimization and creation 
of new shallow architectures has so far proven more fruitful. It remains to be seen whether 
this apparent limitation of hierarchical models will be overcome by, for example, creating 
models and more efficient learning procedures that can be used with larger patch sizes, 
or whether this observation is due to a more fundamental problem related to the task of 
estimating the density of natural images. 
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Figure 9: Log-loss of a GRBM-100 versus number of epochs, averaged over 10 trials using 
the same training and test sets in each trial. After 50 epochs, the log-loss has 
largely converged. No overfitting could be observed. Using a constant learning 
rate instead of a linearly decreasing learning rate had no effect on the convergence, 
which means that the convergence is not just due to the annealing. 



In the following, we will summarize the relevant learning as well as evaluation parameters 
used in the experiments of the experimental section. 

The layers of the deep belief network with 100 hidden units were trained for 100 epochs. 
The learning rates were decreased from 1 • 10~ 2 to 1 • 10~ 4 during training using a linear 
annealing schedule. As can be seen in Figure 9, the performance of the GRBM largely 
converged after 50 epochs. 

The covariance of the conditional distribution of the GRBM's visible units given the 
hidden units was fixed to al. a was treated as a hyperparameter and chosen via cross- 
validation with respect to the likelihood of the GRBM after all other parameters had been 
fixed. Weight decay of 0.01 times the learning rate was applied to all weights, but not to the 
biases, and a momentum factor of 0.9 was used for all parameters. The biases of the hidden 
units of all layers were initialized to be —1 as a (rough) means to encourage sparseness. 

As described in Section 4, the second-layer SRBM was initialized so that its marginal 
distribution over the units it shares with the GRBM is the same as the marginal distribution 
defined by the GRBM. During training, approximate samples from the visible conditional 
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Figure 10: Joint evaluation of the number of hidden units and the component variance. By 
taking more hidden units and smaller variances, the performance of the GRBM 
can still be improved. All models were trained for 50 epochs using CD(1). 

distribution of the SRBMs were obtained using 20 parallel mean field updates with a damp- 
ing parameter of 0.2 (Welling and Hinton, 2002). During evaluation, sequential Gibbs 
updates were used. 

For the evaluation of the partition function and the marginals, we used AIS. The number 
of intermediate annealing distributions was 1000 in each layer. We used a linear annealing 
schedule, that is, the annealing weights determining the intermediate distributions were 
equally spaced. Though this schedule is not optimal from a theoretical perspective (Neal, 
2001), we only found a small effect on the estimator's performance by taking different 
schedules. The number of AIS samples used during the experiments was 100 for the GRBM, 
1000 for the third-layer SRBM and 100000 for the second-layer SRBM. The number of 
second-layer AIS samples had to be much larger because the samples were used not only 
to estimate the partition function, but also to estimate the second-layer SRBM's hidden 
marginals. As can be inferred from Figure 5, even after taking this many samples the 
estimates of the three-layer DBN's performance were still somewhat optimistic. 

Lastly, note that the performance of the GRBM and the DBN might still be improved 
by taking a larger number of hidden units. A post-hoc analysis revealed that the GRBM 
does indeed not overfit but continues to improve its performance if the variance is decreased 
while increasing the number of hidden units (Figure 10). 

Code for training and evaluating deep belief networks using the estimator presented in 
this article can be found under 

http : / /kyb . tuebingen . mpg . de/bethge/ code/dbn/dbn . tar . gz. 
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